Athena
Summary
This document covers the information to gather from Athena in order to configure a Qarbine data service. The data service will use the Qarbine Athena driver. You can define multiple data services that access the same Databricks Athena though with varying credentials. Once a data service is defined, you can manage which Qarbine principals have access to it and its associated Athena data. A Qarbine administrator has visibility to all data services.
Overview
AWS Athena is a query service used to analyze data directly in Amazon Simple Storage Service (S3) using standard SQL. More information on Athena can be found at https://docs.aws.amazon.com/athena/latest/ug/what-is.html. From that page,
In Athena, tables and databases are containers for the metadata definitions that define a schema for underlying source data. For each dataset, a table needs to exist in Athena. The metadata in the table tells Athena where the data is located in Amazon S3, and specifies the structure of the data, for example, column names, data types, and the name of the table. Databases are a logical grouping of tables, and also hold only metadata and schema information for a dataset.
Of particular importance is that Athena supports data structures such as mapped documents and arrays. It is not limited to 2 dimensional rows with simple numbers, dates, and strings. When querying, Qarbine automatically converts the underlying map and array JSON string values into their appropriate rich data structures. There is no casting or unwinding necessary within the SQL unless it is needed as part of the WHERE clause. Athena query results are briefly stored into a writable S3 location as part of the Athena query life cycle.
A useful starting page for Athena can be found at
and information about Athena JSON interactions can be found at
There must be a Qarbine compute node with the Qarbine Athena data driver installed. Athena must be appropriately configured and populated with data. The Qarbine queries are run in an ad hoc fashion so the Athena data may be dynamic. Amazon Athena automatically stores query results and metadata information for each query that runs in a query result location that you can specify in Amazon S3.
Athena Configuration
Overview
From the AWS console navigate to the Athena main page at https://console.aws.amazon.com/athena/. From this page identify the name of the Athena catalog and database containing the data you would like to query.
Qarbine uses an API key to access Athena’s query services. When defining an Athena user choose the option shown below.
The associated IAM role/user must have AmazonAthenaFullAccess and AmazonS3FullAccess policies attached. If necessary navigate from the AWS console navigate to the Identity and Access Management area. Define a user with the permissions directly or by referencing an appropriate group policy. Shown below is a user with the policies directly attached.
Next create an API key by going to the tab highlighted below.
Click
.
You should see
Gather the API key information from the dialog which will be needed shortly.
Close the dialog.
IAM Considerations
The full access permissions for S3 and Athena seems a bit much. The basis for this came from
https://aws.amazon.com/blogs/apn/using-athena-express-to-simplify-sql-queries-on-amazon-athena/
Since Athena writes its output to S3 we need permission to write somewhere within S3. A permission tied to a particular S3 location will be safer. Outside of basic querying, qarbine does list catalogs, databases, and tables along with asking about table structures. See the section below for an example IAM policy. ****
Qarbine Configuration
Compute Node Preparation
Determine which compute node service endpoint you want to run this data access from. That URL will go into the Data Service’s Compute URL field. Its form is “https://domain:port/dispatch”. A sample is shown below.
The port number corresponds to a named service endpoint configured on the given target host. For example, the primary compute node usually is set to have a ‘main’ service. That service’s configuration is defined in the ˜./qarbine.service/config/service.main.json file. Inside that file the following driver entry is required
"drivers" :[
. . .
"./driver/athenaDriver.js"
]
The relevant configuration file name for non primary (main) Qarbine compute nodes is service.NAME.json. Remember to have well formed JSON syntax or a startup error is likely to occur. If you end up adding that entry then restart the service via the general command line syntax
pm2 restart <service>
For example,
pm2 restart main
or simply
pm2 restart all
Data Service Definition
Open the Administration Tool.
Navigate to the Data Services tab.
A data service defines on what compute node a query will run by default along with the means to reach to target data. The latter includes which native driver to use along with settings corresponding to that driver. Multiple Data Sources can reference a single Data Service. The details of any one Data Service are thus maintained in one spot and not spread out all over the place in each Data Source. The latter is a maintenance and support nightmare.
To begin adding a data service click
On the right hand side enter a data service name and description. Also choose the “AWS_Athena” driver. The Compute URL can likely be left as the default for the default setting.
Also choose the AWS_Athena driver.
The next set of values comes from your Athena installation noted above.
For the “Server Options” specify your AWS region, a writeable S3 location for your query results, your AWS access key ID and your AWS secret access key values. They are specified as comma separated “key=value” pairs.
The data options are used to indicate the Athena catalog to query. Currently only the catalog key is required.
The database value is the name of your Athena database.
You can reference environment variables using the syntax %NAME%. Any strings should be quoted and the key\value pairs separated by commas.
Test your settings by clicking on the toolbar image highlighted below.
The result should be
Save the Data Service by clicking on the image highlighted below.
The data service will be known at the next log on time. Next, see the Athena query interaction and any tutorial for information on interacting with Athena from Qarbine.
IAM Policy Example
This example is modeled after the AWSQuicksightAthenaAccess policy. Refer to the AWS online documentation for the latest details.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"athena:GetCatalogs",
"athena:GetNamespace",
"athena:GetNamespaces",
"athena:GetQueryResults",
"athena:GetQueryResultsStream",
"athena:GetTable",
"athena:GetTables",
"athena:ListQueryExecutions",
"athena:RunQuery",
"athena:StartQueryExecution",
"athena:StopQueryExecution",
"athena:ListWorkGroups",
"athena:GetDataCatalog",
"athena:GetDatabase",
"athena:GetTableMetadata",
"athena:ListDataCatalogs",
"athena:ListDatabases",
"athena:ListTableMetadata"
],
"Resource": [
"*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:GetObject",
"s3:ListBucket",
"s3:ListBucketMultipartUploads",
"s3:ListMultipartUploadParts",
"s3:AbortMultipartUpload",
"s3:CreateBucket",
"s3:PutObject",
"s3:PutBucketPublicAccessBlock"
],
"Resource": [
"arn:aws:s3:::aws-athena-query-results-*"
]
}
]
}
Sample Athena Data
Data Row 1
{"orderno": 1, "orderdate" : "2020-05-24", header: {"orderby": "tom", "amount": 59.50}, details:[{detailno:1, cost:5.95, partnumber:"xd03410",linecode:"widget", qty:10} ] }
Data Row 2
{"orderno": 2, "orderdate" : "2020-06-14", header: {"orderby": "sally", "amount": 75}, details:[{detailno:1, cost:2.50, partnumber:"t4704X5",linecode:"thingy", qty:10},{detailno:2, cost:5.00, partnumber:"9561FHY",linecode:"thinga ma bob", qty:10} ] }
Structure Definition
CREATE EXTERNAL TABLE order_plus (
orderno int, orderdate date,
header struct<orderby:string,amount:double>,
details array<struct<detailno:int,cost:double,partnumber:string,linecode:string,qty:int>>
)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION
's3://MY_BUCKET/'